An Automated Conversion of Documents Containing Math into SGML

نویسندگان

  • Janusz Wnek
  • Robert Price
چکیده

Intelligent document understanding (IDU) systems convert scanned document pages into an electronic format which preserves layout and logical document structure in addition to document content. Most of the IDU experimental systems, however, lack the capability of full exploitation of recognition results, i.e., the reconstruction and utilization of complete documents. In this paper we present an integrated IDU system that processes documents all the way from recognition to full utilization using standard generalized markup language (SGML). SGML recognizes that document data, structure, and format are separable elements. The data in a document may include text, graphics, images and even multimedia objects. The structure of a document refers to the relationship among the data elements. The format of a document is its appearance. An SGML document has an associated document type definition (DTD) that specifies the rules for the structure of the document. SGML preserves the data and structure, but does not specify the format of the document -recognizing that format should be optimized to user requirements at the time of delivery. The standardization and widespread use of SGML-based tools provides the means for filling the gap between document recognition and seamless document reuse. The proposed system was designed with the SGML application in mind. From the early stages of document processing the SGML components are used and incrementally constructed. Following SGML convention, in order to convert a document from a paper format to SGML, its logical structure and content have to be mapped to a DTD. To facilitate recognition of the structure, the current version of our system assumes that the paper document was prepared in accordance with some formatting guidelines. These guidelines define standard phrases, ordering, and formatting rules that produce flexible yet consistent document structure, with elements that have either direct or indirect correspondence to the DTD. Directly mapped elements have phrases or formatting styles that correspond to SGML tags. Indirectly mapped elements require special processing and are exemplified here by tabular data and mathematical expressions. The conversion process involves OCR of a multi-page document, document structure analysis, processing of indirectly mapped elements, and generation of the final SGML description. Document structure analysis is reduced to parsing OCR results and recreating document structure by performing fuzzy searches for standard phrases and some format analysis. Tabular data processing utilizes OCR results with positional data, horizontal lines and heuristic rules to determine cell boundaries and contents. Recognition of mathematical expressions involves OCR on an extended symbol set, and equation structure recognition. The equation recognition engine employs a tree representation to store intermediate transformations. The transformations are ordered and involve connecting of separated symbols, context-sensitive OCR correction, extraction of horizontally aligned subexpressions, subscript and superscript processing, and a general processing of symbols detected above or under the target symbol. The system was tested on examples of documents in the standard format. The test cases included tabular data and mathematical equations. The results can be demonstrated using the Panorama Pro SGML browser.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Another Look at L A TEX to SGML Conversion

Publishers are starting to use SGML as their permanent form of storage for documents. How can LTEX files be converted to an SGML instance? This paper discusses possible strategies, and describes an implementation by Elsevier Science of a system based on conversion in TEX itself, and extraction of SGML code from the dvi file.

متن کامل

Extending SGML to Accommodate Database Functions: A Methodological Overview

* Partially supported by US Dept. of Education award number P200A502367 and NSF Research and Infrastructure grant, award number NSF CDA-9303189. Abstract A method for augmenting an SGML document repository with database functionality is presented. SGML [ISO 8879, 1986] has been widely accepted as a standard language for writing text with added structural information that gives the text greater ...

متن کامل

The Structured Information Manager: A Database System for SGML Documents

One of the important standards for document interchange and representation that has emerged is SGML, the Standard Generalized Markup Language. SGML is designed to capture the logical structure of documents, i.e. the logical components such as titles and paragraphs and their interrelationships. SGML is a complex standard, and the design of a database system for managing SGML documents poses many...

متن کامل

Structured storage and retrieval of SGML documents using Grove

SGML standardized in ISO 8879 [International Organization for Standardization (1986)] has been proliferated because it can provide various styles and transform documents on di€erent platforms. The SGML document has logical structure information in addition to the contents. As SGML documents are widely used, there is an increasing demand for a storage and retrieval system to use the logical stru...

متن کامل

A System for Assembling Specialized Textbooks from a Pool of Documents

We consider assembling specialized, customized textbooks from a large collection of SGML documents. Our prototype assembly framework allows the user to select parts of the documents in the collection and to form a new structured document. The order of a user is processed in the following way. (1) The user lls in and submits an HTML form. (2) The form is processed and the parts to be included in...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2007